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SUGGESTIONS FOR THE NEXT REVISION OF THE BINET- 
SIMON SCALE 


GracE H. KENT 


I. INTRODUCTORY STATEMENT 


The new Stanford-Binet test (1), watched for hopefully over a 
period of several years, has at length made its appearance. It is a 
safe assumption that for the next twenty years we shall have to 
accept it for better or worse, because no student of the present genera- 
tion is likely to obtain such financial backing as would make possible 
a further revision with standardization on a large scale. This is 
therefore the favorable time to consider the principles upon which 
such revision should be based in the rather remote future. It may 
be assumed that no test-standardization project now in progress is so 
near completion as to render our suggestions wholly futile. The field 
is clear for constructive criticism, as it has not been during these 
years while we have been waiting for the new Stanford-Binet. 

No test system should be accepted as final. Test questions, how- 
ever carefully framed, show a tendency to pass out of date. Only the 
exceptional problem turns out to be equally significant for all times 
and places. In order to find sufficient material for an adequate test, 
we may have to include problems which are of special local or tem- 
poral significance. Standardization of a test is of itself unfavorable 
to frequent revision, and it is therefore important that each test item 
should be thoroughly tried out by many persons before being standard- 
ized. The test of the future should be the product of many minds, 
based upon the accumulated experience of a generation. 

The new Stanford-Binet scale is of course a much stronger test 
than the 1916 edition (2), especially quantitatively. On the quali- 
tative side it should be mentioned with approval that the blood- 
curdling absurdities of the older edition have given place to state- 
ments which are emotionally neutral and yet interesting; also that the 
memory test has been greatly improved by substituting a single sen- 
tence of suitable length for two sentences combined in one task. It is 
especially with reference to the set-up of the scale that the authors 
have failed to take advantage of the contributions published since 
1916. The criticisms here expressed, based upon upwards of two 
thousand clinical examinations made by the 1916 Stanford-Binet scale 


409 








410 GRACE H. KENT 


during the years 1918-30, refer primarily to the features which are 
retained essentially unchanged in the new edition. 

This is offered as a personal opinion which may be somewhat dis- 
counted because of the writer’s deviations from accepted standards 
concerning the use of mental tests, especially as follows: 

1. Rejection of the “IQ” as derived by Stern and Terman, ex- 
cept for the child who tests approximately at age. 

2. Unwillingness to use the term “intelligence” as applied to any- 
thing that can be measured. 

3. Opposition to any rating by any test, as such, for diagnosis 
of mental deficiency. 

With increasing experience in the use of tests the writer has been 
losing confidence year by year in the propriety of evaluating individ- 
ual capacity and achievement by any statistical method, including 
those tests developed personally for personal use. This does not mean, 
however, that all standardized tests are to be equally condemned. One 
may strongly disapprove the current use of standardized tests while 
yet recognizing that some tests are stronger than others and that some 
tests lend themselves very readily to misuse. It is a serious offense 
on the writer’s part to have published at least one test which is highly 
susceptible to gross abuse, but no apology is offered for criticising 
certain details and items of the Stanford-Binet scale instead of attack- 
ing statistical diagnosis in its entirety. Diagnosis by statistical tables 
‘is a fact, however regrettable a fact; and therefore the nature of the 
tests used in diagnosis is very far from being a matter of indifference. 

It is freely conceded that the Stanford-Binet scale is presumably a 
much better tool in the hands of some clinical workers than in the 
writer’s hands. Whatever the intrinsic merits of a test system, the 
personal reaction of the examiner is too important a factor to be dis- 
regarded. It certainly has a marked effect upon the examiner’s satis- 
faction with an examination, and it may in some cases affect the actual 
results. An examiner who likes this particular instrument can doubt- 
less use it to better advantage. 


II. THE AGE-GRADE METHOD OF EVALUATION 


The basic construction of the Binet-Simon scales—the age-grade 
or year-scale method used by Binet and followed by Terman—is need- 
lessly cumbersome and uneconomical as compared with a scoring sys- 
tem in which the responses are evaluated by points. Unconditional 
acceptance or rejection of a response is unfair to the subject whose 
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response just barely misses being acceptable. The fact that a given 
problem can be solved by seventy per cent of the normal children of 
a given age may prove the suitability of the problem for that par- 
ticular mental level, but it does not necessarily render that problem 
useless at all other ages or levels. Nor does recognition of different 
degrees of acceptability necessarily make the evaluation more sub- 
jective. At least for multiple items—including all items for which 
two or more correct responses are required before the subject can 
receive credit at the specified mental level—the recognition of partial 
credits would be quantitative rather than qualitative and would obvi- 
ously be more fair to the subject. In some instances even qualitative 
differences might be recognized without increasing the subjectivity 
of evaluation. 

Within the year-scale system, the subject matter of the Stanford- 
Binet scale is used with unnecessary wastefulness. When a child re- 
sponds correctly to only three of the five absurdity questions, he of 
course fails to achieve ten-year credit on that item; but might he not 
be given nine-year credit for three acceptable responses, and possibly 
eight-year credit for two responses? We who have not had access to 
the sources do not know the mental-age value of anything less than 
four responses, but the test could hardly have been standardized with- 
out establishing such values for one, two and three responses. Why 
should we be denied the use of these additional values? Why have 
we been given no nine-year value for the vocabulary test, an item 
which must of necessity be presented at the nine-year level and inci- 
dentally the item which requires more time for presentation than any 
other item at that level? If twenty words (correctly defined) repre- 
sent a mental age of eight years and thirty words a mental age of 
ten years, there must be between twenty and thirty some score which 
represents the achievement of the nine-year children from whom the 
norms were derived. Inasmuch as the vocabulary test is recognized 
as being an exceptionally strong item, the scale would be strengthened 
at the nine-year level by including this item. The time for presenting 
the scale would be appreciably reduced by omitting some less signifi- 
cant item, such as rhyming of words or arrangement of weights; and 
the item thus omitted might still be useful occasionally as an alternate. 

But even if we were permitted the use of these unpublished score 
values for multiple items, the Stanford-Binet scale would still be 
wasteful of time. The year-scale method is essentially an uneco- 
nomical scoring system for a subject of whatever age, because each 
examination includes so many items which add nothing to its ade- 
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quacy. It is strictly required that the examination be carried low 
enough in the scale to yield six correct responses at a given age-level; 
and that it be carried high enough to elicit six consecutive failures. 
Thus there are at least twelve items (some of them multiple items 
requiring considerable time for presentation) which are frankly non- 
discriminative for the particular subject examined; and if so unim- 
portant an item as repetition of digits be the only success or the only 
failure at a given level, we may count five additional items as being 
essentially non-discriminative for the subject, thus including seventeen 
items which are used merely to satisfy a formal requirement. 


On no account would it be permissible for the examiner to omit 
any of these useless questions on the ground that the subject’s response 
may be predicted with reasonable confidence. Every experienced 
examiner knows that whether a subject will or will not be able to 
answer a given question is not a matter for conjecture. After the 
norms of a test have once been established according to a given method 
of presentation and scoring, strict adherence to that method is essen- 
tial. The Stanford-Binet scale might have been made far more con- 
venient and less time consuming, but now it must be accepted as it 
stands. 

This should not be taken as a criticism of the length of time 
required for a psychometric examination, which is indeed far too short 
a time for passing upon the mental capacity of a child. It is not for 
the purpose of shortening the examination that the non-discriminative 
material should be reduced to the minimum, but rather to gain time 
for the use of additional tests not included in the scale. Time used 
for presenting questions which are clearly above or below the sub- 
ject’s mental level may fairly be counted as wasted whenever it can be 
shown that a different method of presenting the norms would permit 
these inappropriate questions to be omitted without impairing the 
adequacy of the examination. 

Furthermore, the time thus wasted in a psychometric examination 
is much worse than wasted. The typical clinical subject over ten 
years of age is usually sensitive about being brought to the clinic at 
all. The items which are most annoying to a self-conscious child or 
adolescent are these questions at the upper and lower ends of his 
natural range. The too-difficult questions are humiliating to the sub- 
ject, and the too-easy ones are insulting. The younger child is espe- 
cially disconcerted by the questions which he cannot answer, and it is 
the adolescent subject who is more likely to feel insulted by a question 
which can be answered without a moment’s thought. In many cases 











REVISION OF THE BINET-SIMON SCALE 413 


a child’s reaction to this non-discriminative material is so strong as 
to affect his attitude toward examination and examiner, thus making 
it difficult or impossible to obtain the full cooperation upon which a 
valid rating depends. It is true, of course, that the subject's range 
cannot be properly covered by the test without including some matter 
both above and below his level; but it is nothing less than inhuman to 
use a larger amount of inappropriate material than is necessary to the 
adequacy of the examination. 


III. OTHER BINET-SIMON REVISIONS 


There is little to be said concerning the earlier and later American 
editions of the Binet-Simon scale. No other edition has been so 
widely used as Stanford-Binet, nor has any other been legally recog- 
nized as having diagnostic significance in cases of mental deficiency. 
Detailed analysis is therefore unnecessary. 

The criticism concerning excessive use of non-discriminative mate- 
rial applies also to the Goddard-Binet scale (3) which has been super- 
seded by the Stanford-Binet, and to the Kuhlmann-Binet scale (4) 
which is both more adequate and more time-consuming than the 1916 
Stanford-Binet. The Yerkes-Bridges Point Scale (5) represents an 
honest and commendable attempt to give us a more equitable system 
of evaluation than that of the age-grade scales; but it is not less waste- 
ful of time nor less rigid in application. 

It was Herring (6) who appreciated the importance of reducing 
the non-discriminative matter to be included in a given examination 
and who endeavored to give us a point scale better adapted to the 
somewhat crude conditions of our clinics. The scale consists of thirty- 
eight items, the easy and difficult questions being well distributed 
throughout the series. Independent norms are offered for five differ- 
ent stages of completeness, by reason of which it is possible to give the 
subject a tentative rating derived from a small group of questions 
while leaving the examination to be completed at a later date. Further- 
more, special provision is made for the omission of certain items which 
are assumed—specifically on the strength of success or failure in cer- 
tain other items—to be non-discriminative for the subject, credit for 
the omitted items to be given or denied in accordance with fixed rules 
which were followed in the standardization. 

The writer acknowledges considerable indebtedness to Herring in 
the early development of the “Emergency Test” (7). Anyone who 
uses the Herring scale can see that there is much to be learned from it, 
although it is obvious that the items were selected without anything 
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approaching adequacy of preliminary try-out. One need not approve 
the actual substance of the scale, the flat-rate scoring system or the 
method of standardization to recognize Herring’s work as a contribu 
tion of real value. 


IV. THE PINTNER-PATERSON PERFORMANCE SCALE 


For the fundamental construction of a test system, the plan given 
us in 1917 by Pintner and Paterson (8) seems to the writer far and 
away the most serviceable set-up yet offered, for either language tests 
or performance tests. This “scale”, which might more appropriately 
have been designated a collection of performance tests, includes fif- 
teen unrelated tasks which were standardized as independent units. 
From this series of tasks the examiner may select for a given examina- 
tion any units that are passably well suited to the ability of the sub- 
ject, omitting those tasks which are either too easy or too difficult to 
be discriminative for this particular subject. Inasmuch as each task 
yields an independent “mental age” rating, the median rating of the 
series of units used in a given examination may be recorded as the 
subject’s “mental age” as determined by this series of tests. 

Some of the form boards included in this scale are evaluated inde- 
pendently by speed and by errors or moves, so it is possible to count 
one task as two units of the series. The examiner may lose count of 
the moves and still have the speed rating to be used as one unit; or he 
may lose the time record and yet have the task represented in the 
series by the move-count score. 

These authors did not presume to smooth the age-curves for pub- 
lication, but presented in its crude and natural state such material as 
they could collect. Thus the student who uses the tests may construct 
norms for his own use, with full knowledge of the nature and degree 
of each deviation from the natural age-curve. Furthermore, the data 
are presented in so detailed a form that anyone who has collected 
records from normal children may combine his own findings with the 
published data for the purpose of strengthening the norms. 

It was probably by accident that the writer discovered the value 
of so elastic a system as the Pintner-Paterson scale—the accident of 
having only a few scattered units of the scale instead of a full set. 
The performance test equipment of Worcester State Hospital in 1922 
consisted of four Pintner-Paterson tests and a scroll saw. The five 
form boards of the scale were added one at a time, each being pressed 
into service before receiving its final coat of varnish. It was a matter 
of necessity to use the scale in part rather than in its entirety, and a 
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short series of performance tests seemed obviously better than none at 
all (9). Gradually it became apparent that the system admitted of 
expansion no less than of contraction; and that various other per- 
formance tests—-such as the Ferguson form boards (10), the Kohs 
Block Design test (11) and the Young Slot Maze (12) were far more 
valuable when used as new units of this system than when used as 
isolated tests. 

At the present time the writer’s working psychometric outfit con- 
tains one sole survivor of the Pintner-Paterson scale—the Mare and 
Foal board, the norms of which have been exteaded to the four-year 
level by the addition of the Stutsman data (13). The fact that the 
system has been in constant use during the time when so many units 
have been gradually superseded is offered as proof that the system 
itself is very much alive. 

The foundation laid by Pintner and Paterson is worthy of a vastly 
better structure than has yet been raised upon it. 


VY. ComposiTE TESTS AND BATTERIES 

The term “battery” has been applied rather loosely to a series of 
mental tests. Without known authority for a more specific mean- 
ing and without knowing the source of the term as applied to tests 
at all, the writer has taken the liberty to use it with reference to a 
series of test units having a common denominator, the end-result of 
the series being expressed as a median rating rather than a rating 
based upon a composite score or upon the average of several ratings. 

It is recognized that almost any battery unit must be more or less 
composite in nature, but the term “composite test” is used here as 
applied to any system in which all the items are evaluated as a whole, 
the entire scale—-possibly the entire examination—yielding only one 
rating. A series of wholly independent tests would become a com- 
posite in effect if the results were reduced to one figure by using the 
average of the ratings instead of the median. 

The distinctive feature of the battery, according to this definition, 
is that those tests which lie in the middle of the subject’s range of 
achievement are weighted in determining his rating. The tests yield- 
ing the ratings at the ends of the series are reduced in value as com- 
pared with the middle ratings, on the ground that they may be weak 
in discriminative capacity for this subject, that they may measure 
special aptitudes in which he is notably strong or weak, or that his 
achievement in these tasks may have been influenced by some unknown 
chance factor. These exceptionally high and low ratings still retain 
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the strength of their position in the series, but their actual score values 
are not used as such in determining the final rating. These end ratings 
must not be considered too significant, because a well-balanced battery 
would naturally show a rather wide spread of individual ratings. It 
may be assumed that the tests yielding the middle ratings are probably 
more representative of the subject’s mental capacity in general, and 
that (if the results must be expressed numerically in a single figure) 
the median rating or middle group of ratings may be accepted as af- 
fording the safest estimate of what the series of tests is intended to 
measure. 


Kuhlmann recommends using an even number of tests in the 
series, “so that the exact median will be determined by two tests 
instead of one.” (14) However, if the common denominator be the 
“mental age”, a rating reported in months or fractions of a year sug- 
gests a degree of exactitude which is potentially misleading. The 
writer formerly made a practice of taking the average of the middle 
half of the ratings in the series; but later adopted the median rating 
from an odd number of tests, specifically to avoid implying the claim 
that we can determine a person’s “mental age” to the month. This is 
perhaps a matter for personal preference. In a test intended for 
classification of school children of a given grade, it may be advan- 
tageous to evaluate the results so as to show the finest distinctions: 
but for the individual clinical subject it is more important (or at least 
it seems more important to one who challenges the propriety of con- 
verting the “mental age” into an “intelligence quotient” by any means 
now available) that the rating be clearly recognized as the approximate 
estimate which it really is. 


Other things being equal, the battery possesses three outstanding 
advantages over the fixed composite scale which is standardized as a 
whole. There is at this time no battery available which possesses all 
these advantages or which can be substituted in full for the scales 
given us by Terman. But the potential advantages of the battery are 
incalculable. 

Ist. The battery of independent units can be adapted to the 
individual subject. The importance of flexibility in a clinical test will 
be considered at length in Section VIII. 

2nd. The battery is a growing system, one which permits prac- 
tically unlimited growth. 


A thoroughgoing revision of the Binet-Simon scale is too costly a 
project to be undertaken more than once or twice in a generation; 
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also, the standardization requires so many years that there is time for 
items to lose something of their significance before the test is ready to 
be used. We cannot depend upon such a scale as Stanford-Binet 
without being forced to make considerable use of out-dated material. 

The battery, on the other hand, is so loosely constructed that any 
unit can be revised or superseded without disturbing the system as a 
whole. It is at all times in process of being rejuvenated. A unit 
which is no longer useful may be dropped from the series, and any 
new test which has been standardized independently may be intro- 
duced into the system as an additional unit. To go back to the 
Pintner-Paterson scale for illustration: the scale included the first 
Healy Pictorial Completion, a very crude picture test which has long 
since been superseded by Pictorial Completion II. (15) The later test 
may be substituted for the earlier one without affecting the rest of 
the system. Any other obsolete unit may be discarded in favor of 
any new unit now available. 


3rd. Inasmuch as the material for a battery can be built up grad- 
ually, there is abundant opportunity for any student who is working 
on tests to make his personal contribution. The development of the 
system unit by unit rather than as a complete scale is highly favor- 
able to the use of material drawn from many sources. 

The establishment of an adequate test system is a project for col- 
lective rather than individual effort. In the early stages it is desirable 
that the workers be widely scattered, so that children from every 
section of the country may be represented in the first tentative norms 
of some test. Division of labor among a large group of students is 
favorable also to the careful and thorough preliminary work that is 
so important for preventing the establishment of false standards. Each 
item considered for standardization ought to be well tried out in 
clinics, preferably by persons not too close to the source. The author 
of a test is little better qualified to appraise its true value than to pass 
upon the real merits of his own children. No test can safely be con- 
sidered ready for large-scale standardization until it has been subjected 
to the critical analysis of unbiased persons who have made éonsider- 
able actual use of it. Once again we are indebted to Pintner and 
Paterson for pointing out a way of utilizing the contributions of 
others. Being in need of tests that could be used for deaf children, 
they collected for standardization whatever non-language tests could 
then be found, adding a few tests of their own to fill in the gaps but 
taking advantage as far as possible of the work already done by others. 
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In the 1937 year book of the Psychological Association there are 
557 persons listed as being engaged in research on tests, including 197 
persons who occupy clinical positions or who offer other evidence of 
clinical interests. It is a safe guess that at least 100 students are 
working—possibly in a small way—on the development of mental tests 
which have been framed to meet their own personal needs in clinical 
examinations. If it cannot be expected that all tests thus developed 
will be found generally serviceable, at least it may be assumed that 
the total output would permit considerable weeding out and yet yield 
useful material far more varied than that now possessed by any single 
clinic. Publication of preliminary reports on such tests would not 
only serve to make the tests available to others but would also enable 
the authors to take advantage of the criticisms of other students. The 
premature publication, even of those tests which are not destined for 
survival, is not likely to do more serious harm than that of adding a 
bit to the labors of the student who is held responsible for working up 
an exhaustive list of references on tests. The defective test published 
as a detached unit is relatively innocuous, as compared with the de- 
fective item which is incorporated into a composite scale. At worst, 
the use of isolated units is not forced upon any examiner who recog- 
nizes their inherent weakness and who prefers not to use them; at 
worst, they do not serve to invalidate a system which—if used at all— 
must be used as a whole. 

It is true that centralized effort has its place in the establishment 
of a test system. The time may come when it would be desirable to 
have a single university take over the material developed in the rough 
by numerous widely scattered students, so as to weld it into a 
coordinated whole. Undoubtedly it would be desirable some time to 
use a given group of subjects in the standardization of a given group 
of tests, for determination of the intercorrelation among the several 
units. But the advantage of having a system ultimately coordinated 
would be in no way imperiled by building it up out of detached units 
which have already had abundant opportunity to prove their worth. 

It isthe battery of unrelated units, rather than the elaborate com- 
posite scale, that opens the way for a large number of workers to 
have a share in this hypothetical test system of the future. The bat- 
tery-idea encourages each student to concentrate his efforts upon the 
single unit or small group of units which he is prepared to bring to 
the highest degree of development permitted by his resources. 
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VI. BATTERIES, ACTUAL AND POTENTIAL 


It is among group-presented tests intended primarily for classifica- 
tion of school children that the battery has been most used. 

The Kuhlmann-Anderson series of school tests (14) includes 39 
independent units, with picture tests at the lower age levels and writ- 
ten tests for the higher grades, the series being so graded in difficulty 
as to be discriminative from six years up to the adult level. The first 
ten tests are used for first grade and the last ten tests for high school; 
and each intervening grade is covered by a ten-test series appropriate 
for the mental level of that grade. The test forms are available both 
in booklets intended for group presentation in school and also in single 
sheets which are more convenient for individual examination. 

The value of this system which Kuhlmann and Anderson have de- 
veloped is something which cannot easily be over-estimated. Their 
series offers a plan which should receive the most careful consideration 
by anyone who is interested in the development of tests for clinical 
‘use. A few of these units, as they stand, are useful in the clinic as 
a check on other tests. It cannot be claimed, however, that the series 
offers very much material that is adapted to the needs of the clinic. 
The time limits at the lower end of the series are so exacting as to be 
irritating to the subject; the pictures are for the most part too crudely 
drawn; and the pages are too small and crowded. Furthermore, the 
range of discriminative capacity of each unit is adapted to the child 
known to be in a certain school grade rather than to the clinical sub- 
ject whose mentality is a matter of conjecture. In order to find the 
appropriate tests for an adolescent subject of whom nothing is known, 
the examiner must try one unit after another, sometimes using enough 
tests to obtain a perfect score at one end of the series and a zero score 
at the other. This is exactly the procedure which makes the age-grade 
scales so difficult to use in the clinic and so needlessly wasteful of 
precious time. 

Another school test standardized as a battery is Baker’s Detroit 
Primary Intelligence Test (16), intended for grades II to IV. For 
this very limited range of mental levels, approximately 7 to 10 years, 
these seven tests are highly serviceable. Norms are presented for each 
unit individually and also for the composite score. The time limit, if 
any, is for an entire page. There is one page of reading matter, which 
must be omitted for the subject who cannot read. This one unit 
would make the test almost useless in the clinic if the norms had been 
given only for the composite test. 
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There are many school tests, both written tests and picture tests, 
which could be used in the clinic if we had separate norms for each 
unit. Such tests as Army Alpha (17), National Intelligence Test 
(18), Terman Group Test (19) and Myers Pantomime Test (20) are 
made up in form suitable for standardization as batteries, each page 
being a unit with reference to the timing and the instructions. ‘These 
tests, and many others like them, contain some units which would be 
serviceable in individual examinations. 

Group standardization does not necessarily spoil a test for indi- 
vidual presentation. It is largely because of their time limits that 
school tests are of so little use in clinics. With unlimited time allow- 
ance, some of the tests greatly needed in the clinic could be standard- 
ized as adequately by group examination as by individual presentation. 
The Kent-Shakow written battery (21) was standardized (inade- 
quately) without time limit, in order to have a few written tests on 
which the clinical subject might be permitted to take his own time. 
Many more such tests are desperately needed, especially written tests 
adapted to third-grade reading ability and picture tests for illiterate 
subjects of the higher mental levels. 

The need of more and more performance tests is widely recognized 
by clinical workers. The ideal psychometric outfit for the clinic would 
include a battery of at least seven non-language units, each unit being 
a graded series of tasks. These performance tests, however, should 
not be evaluated in the same battery with language tests. Manipula- 
tive tests which depend upon skill or special aptitude or experience 
should be evaluated either as independent units or in a battery con- 
sisting wholly of non-language tests. They constitute a very im- 
portant part of a clinical examination, but they involve too much 
chance to justify including them with language tests. 

All tests depending upon the use or understanding of language 
may be classed as language tests and may be included in the battery 
from which the median rating is derived. The units of a battery 
should be sufficiently varied to yield a rather wide spread of ratings 
for almost any subject; and a group of at least three similar ratings 
at the center of the series is needed as proof that the median rating 
is not determined by chance. The number of units required for an 
adequate examination is variable. 


VII. THe Grapep SERIES IN THE TEST SYSTEM 
Both for accuracy of standards and for convenience of presenta- 
tion, it is desirable that each battery unit be a graded series consisting 
of tasks essentially similar in kind but differing in degree of difficulty. 
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It is unduly arbitrary to use wholly different methods for meas- 
uring “intelligence” at two adjacent age levels. The manifestations 
of intelligence in a child of six and those in a child of seven have much 
more in common than have the test items for those two ages in the 
age-grade scales. Inasmuch as there is so much overlapping in what 
we wish to measure, it is only natural that there should be considerable 
overlapping in the methods used for measuring it. Any test which 
possesses sufficient intrinsic merit to justify including it at all should 
be so graded as to cover those levels for which it can appropriately 
be used. If the ability to copy a square be significant at four years 
and the ability to copy a diamond at seven years, there must be other 
geometric forms the copying of which would be significant at other 
age levels. (One of the Kuhlmann-Anderson units is a page of draw- 
ings to be copied, intended for use in grades IV to VI). 

At each age it is possible to introduce some new test that could 
not appropriately be used at a lower level. Such a test, if it measures 
an aptitude that is worth measuring, should be carried along for sev- 
eral years with tasks of increasing difficulty. Beginning at nine or ten 
years it would be appropriate at each year to drop out one of the 
units brought up in a continuous series from the lower levels. Thus 
it would be possible to have the discriminative capacity of the system 
rising gradually to higher levels, without sharp transition between any 
two adjacent levels. 

There is not much room in a battery system for detached items 
which cannot be graded and which are useful at only one mental 
level. However, a few such items that are considered too valuable 
to be discarded might be made up into a short composite point scale to 
be used as one unit of a battery. 

The most valuable tests are those which can be graded over the 
widest range of mental levels. The range of ages for which a test is 
satisfactorily discriminative can easily be determined by taking a few 
preliminary records, after which the standardization should in all 
cases be carried one year higher and one year lower than the range 
for which the test is to be used. The levels at the extreme ends of the 
standardization are not properly covered by the test, and no “norms” 
should be offered for those levels except with reservations. 

Any language test which covers a wide range of ages may be 
divided into sections for standardization, thus reducing to the min- 
imum the labor of standardizing a test beyond the ages for which it is 
to be used. This may be expressed diagrammatically: 
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E F G H I 
C D E F G 
A B C D E 
6 7 8 9 10 11 12 13 14 


The letters A to I represent items of increasing difficulty in a test 
which is discriminative for the ages 6 to 14 years. These items are 
to be evaluated by points, weighted empirically, the weighting of a 
given item to be the same for all three sections but the norms for each 
section to be independent. The lower section is to be standardized for 
the ages 6 to 10, the middle section for the ages 8 to 12, and the upper 
section for the ages 10 to 14. 

The presentation of a test thus standardized is extremely simple 
and convenient. In presenting it to a subject of unknown mentality 
or school achievement, the examiner would begin with item E, which 
is common to all three sections; and would be guided by the results 
in deciding whether to go up or down. There is no need of present- 
ing any item which is not actually to be used in the results. 

This plan of division into sections for standardization is for lan- 
guage tests, not for performance tests. In any series of performance 
tasks which are approximately similar in kind, the effect of practice 
is so strong a factor as to make it necessary for each subject to start 
at the bottom of the series. 


VIII. PsycHOMETRIC PROBLEMS OF THE CLINIC 


The great need of the clinic is a test system that can be adapted 
to the individual subject. 

The subjects studied in some clinics—including court cases occa- 
sionally referred for psychometric examination—range in age from 
‘two years to sixty. We have occasion to examine pre-school children 
who are afraid of strangers, school children who cannot read but who 
appear otherwise normal, children whose speech is so defective as to 
be almost or wholly unintelligible, adult immigrants who understand 
very little English, and elderly persons who are not provided with 
glasses. Our cases are not made to order, with special reference to the 
measurability of their mentality by means of the test known as the 
“accepted standard”. We have to take them as they come. 

It is impossible that the Stanford-Binet scale or any other inflex- 
ible system should be suitable for cases of unusual types. Unfor- 
tunately, these are frequently the cases in which there is something 
serious at stake. When a child of reading age is referred to the clinic 
because of his failure to learn to read, it is of the first importance to 
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ascertain whether his mental capacity is or is not within normal 
limits. A composite test which contains reading matter for the sub- 
ject discriminates against the subject whose inability to read is due to 
any cause other than mental retardation. A test which calls for oral 
response discriminates very seriously against the child who by reason 
of speech defect or impediment is unable to make himself understood: 
It is little more than a farce to use a timed test or a test containing 
timed items for a psychotic subject whose mental processes are path- 
ologically slowed up. What we measure by the test may be signifi- 
cant, but it is something quite other than what the test is intended to 
measure. 

Even when there is no special disability to invalidate the results, 
we can hardly expect to gain the full cooperation of an unwilling 
subject unless we are prepared to show some consideration for his 
likes and dislikes. Clinical subjects are usually brought to the clinic 
contrary to their own wishes, and it is the responsibility of the ex- 
aminer to overcome their antagonism by arousing some interest in the 
examination. 

There are literate subjects who respond better to oral questions 
than to a written test of any type; and there are others who cannot 
be induced to give any oral response beyond an occasional -whispered 
monosyllable but who will perform a written task with keen enjoy- 
ment. There are many subjects who fiatly refuse to attempz the solu- 
tion of any arithmetical problem; and in an appreciable proportion of 
such cases the examiner is able to prove subsequently that the refusal 
was due to something other than actual inability to solve the prob- 
lems presented. (The distaste for arithmetic is a large factor in mak- 
ing composite school tests unsuitable for clinical use.) Some adult 
subjects who are sullen or irritable when given any task suggestive of 
school work will give excellent response to the challenge of a mechan- 
ical puzzle; while others will turn away contemptuously from any per- 
formance task on the ground that it is “child’s play”. It is only the 
exceptional subject, of any age, who responds equally well to all the 
different tasks. In order to rate the typical clinical subject with any 
approach to fairness, it is essential that the examiner be free to omit 
any particular task which fails to command a reasonable degree of 
cooperation. This is entirely possible when each task is evaluated 
as a unit for use in a battery. 

Obviously, the flexibility of the battery is its most important ad- 
vantage over the composite scale. 
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The 1916 Stanford-Binet scale is the most rigid test which the 
writer has used to any considerable extent, and it is chiefly because 
of its rigidity that it has been found so unsatisfactory. If the ex- 
aminer were at liberty to omit a few unsuitable items, including the 
non-discriminative material at the ends of the subject’s range, it would 
be possible to use the scale with passable satisfaction in the large 
majority of cases. 

The climax of complexity is reached in the new Stanford-Binet, 
with all the rigidity of the earlier form still retained. If each item of 
this test which admits of being graded in difficulty had been made up 
in the form of an independent graded series, discriminative for the 
range of mental levels for which it is appropriate, the resulting col- 
lection of tests would offer a wealth of material sufficiently varied to 
contain something suitable for almost any subject who can be tested 
at all. This scale contains the raw material for a remarkably adequate 
system, but. it is given us in a form which renders it inconvenient in 
all cases, wasteful of time in all cases, and invalid—in varying de- 
grees—in a very large proportion of clinical cases. It is in the clinic 
that we are especially in need of a flexible test that can be adapted to 
the individual subject; but the test which has been individually stand- 
ardized for clinical use has been made so inflexible that it is almost 
exceptional to have a subject for whom it can be used with satisfactory 
validity. 

IX. THE VocABULARY TEST 


Of all the items included in the Stanford-Binet scale, the one 
which seems to the writer most strikingly to fall short of its possi- 
bilities is the vocabulary test. 

In the first place, 100 words (or 50 words) selected by rule from 
a dictionary of 18,000 words do not afford a large enough sampling to 
justify a conjectural estimate of the subject’s total vocabulary, nor is 
such estimate of any use in determining the subject’s “mental age”. 
Whether we accept the total-vocabulary estimate or not, we base the 
ten-year child's rating upon his thirty acceptable responses, not upon 
the 5400-word vocabulary which he is supposed to possess. And from 
the viewpoint of one who has no interest in an estimate based upon 
so meagre a sampling, the chance-selected list of words possesses no 
advantage over the much better word-list that might have been pre- 
pared. If we require the subject to give an oral definition of each 
word for which he may receive credit, there would be many advan- 
tages in using words which admit of being defined by the person who 
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recognizes them—words which would elicit more uniform responses 
and which would permit more uniform evaluation. 

In the second place, the plan of having the words orally defined 
by the subject at all is open to serious objections: 


1. It involves the personal equation of the examiner both in pre- 
senting the words and in scoring the results. 

2. The request to define a word is unduly annoying to a large 
proportion of subjects; and the succession of failures with which the 
series is usually brought to a close makes it an instrument of torture 
to the extremely sensitive subject. 


3. It measures the subject’s willingness to attempt a definition, 
not invariably his actual ability to offer an acceptable response. Sub- 
jects differ widely in respect to the standards of certainty which seem 
to them to justify a response. It is sometimes the highly intelligent 
person who is most reluctant to offer anything short of a definition 
worthy of the dictionary, and who will decline to answer at all rather 
than attempt a crude explanation which does not satisfy his own 
standard of definition. Even at the five-year level a bright child has 
been known to offer no response to the question “what is a chair?” 
because—as was subsequently learned—it did not occur to her that a 
mere statement of use would be acceptable as a definition. On the 
other hand, there are children of low mentality who will respond 
promptly to every word on the list, perfectly satisfied to name any 
known word or even to coin a word of similar sound. 

There may be a proper place in a mental test for something that 
will measure a child’s self-confidence. We might use the vocabulary 
test as it stands, basing the score upon the number of attempted re- 
sponses instead of the number of acceptable responses. But if the 
purpose of the test is to determine which of these words the subject 
actually knows, it is unfair to employ a method which places so heavy 
a penalty upon a not unreasonable diffidence. 


4. For the typical clinical subject (as opposed to the school chil- 
dren upon whom tests are standardized) the request to define a word 
seems unnatural and wholly remote from everyday experience. We 
use words at all times, but only rarely have occasion to define them. 
Even the mature student, accustomed to enlarging his vocabulary by 
reading and only occasionally by consulting the dictionary, recognizes 
and uses many words which he would not venture offhand to define. 
A person unaccustomed to the use of the dictionary frequently feels 
bewildered when requested to define a word which has no familiar 
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synonym. One who does not know how liberally the responses are 
scored may well feel trapped. 

The written multiple choice vocabulary test possesses the follow- 
ing advantages: 


1. It can be uniformly presented and objectively evaluated. 


2. The subject is less enibarrassed by the word which he does not 
recognize, because his attention is immediately diverted from it by the 
challenge of the next line. Failure is not so open nor so humiliating. 


3. While it measures something different from that measured by 
oral definition, it measures more accurately what it does measure. 


4. Better and more uniform cooperation is obtained from the 
subjects, on the whole. It still holds that some subjects are more 
willing than others to hazard a conjecture—that some persons will 
guess when uncertain while others will omit the line. But the mul- 
tiple choice test is so built as to encourage a guess, and the diffident 
subject frequently responds to the suggestion. 


5. The written test admits of group standardization, and this 
makes possible more thorough standardization than is usually possible 
for an individually presented test. 

The vocabulary test is one of the very few tests that can be 
graded—not in one continuous scale but in a series of overlapping 
scales—from the three-year level up to the superior adult level. Be- 
cause of its wide range of applicability and also because of its recog- 
nized importance, the vocabulary test may be used to illustrate a 
possible method of having all mental levels covered by what is essen- 
tially a single unit. The scales suggested are as follows: 


1. Picture test, multiple choice, intended for pre-school children 
who can be induced to point but who cannot be induced to speak. 
A test of this type developed by Van Alstyne (22) shows the possi- 
bilities of the method. It is suggested that the pictures be in rows 
rather than in groups, with three, four and five pictures on a card 
about fourteen inches long. All the pictures should be very clearly 
drawn and well spaced. Those intended for the lowest age levels 
should be large, with only three pictures on the card. The size of the 
pictures may be reduced for the more difficult tasks which show four 
or five pictures for the child to choose from, but it is desirable at all 
levels to have all the pictures on a given card drawn to approximately 
the same scale. This test may be made discriminative from three years 
to five or six, and should be standardized from 2/2 to seven years. 
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2. Picture test, with single pictures to be named by the subject. 
This would be parallel to the multiple choice test, but intended for 
children who will speak. 


3. Picture test for higher mental levels, to be discriminative from 
five years up to about twelve. (One who has not attempted it has 
no right to state that it would be possible to carry it up to this level 
without introducing difficulty of visual perception and thus making 
it something quite other than a vocabulary test. The suggestion is 
offered, however, on the ground that such a test would meet a real 
need in the examination of non-readers.) This test might possibly be 
arranged for group standardization, with pages of pictures to be 
marked by the subject. At the early age levels the standardization 


should be individual. 


4. Verbal multiple choice test, covering the ages eight to twelve 
or thirteen, to be standardized from seven to fourteen. There should 
be no time limit. The forms should be printed in large, clear type. 
For children whose reading still falls far short of being automatic, 
four words to the line may be preferred to five words. Several forms 
of this test, with not over thirty items to the page, would be more 
useful than a longer test. 


5. More difficult multiple choice test, with five words to the line 
and 40-50 lines in the test. For group standardization in grades III 
to IX, to be used for ages ten to fourteen. This might well be stand. 
ardized both with and without a time limit. 


6. Upper level multiple choice test, to be standardized fom grade 
Vill through college. 

It is important that the overlapping between adjacent scales be 
sufficient so that most mental levels shall be covered by more than 
one test. 


X. Projects FOR DEVELOPMENT 


In seeking new material for tests one can make considerable use 
of the preliminary work done by others without any actual overlap- 
ping of elements; but one should be careful not to spoil a test already 
published by taking over selected elements from it. No two stand- 
ardized tests should be so much alike that the use of either for a given 
subject would render the other test of questionable validity for that 
subject. This is especially important with reference to pictures and 
verbal items. Aside from the danger of encroaching upon tests al- 
ready published, it is desirable that the tests at our command be as 
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varied as possible. The ideal battery for any subject would consist 
of units having very weak intercorrelations and yielding a wide spread 
of individual ratings. 

There is a serious lack of non-juvenile tests for adults of the lower 
mental levels. Most tests intended for the eight year level are adapted 
to the interests of the children upon whom they are to be standard- 
ized rather than those of the older subjects for whom they are to be 
used. Adult subjects of the mental levels 6 to 9 years are apt to resent 
very keenly being treated as if they were children. It is only at the 
pre-school mental levels that tests intended for children can be appro- 
priately used for adults. 

There is a worthy project for the person who can draw. We 
need more and more picture tests, some specifically for children and 
some appealing to the interests of adults. There are many picture 
tests in use for classification of the children of first and second grades, 
but the writer cannot name one that is exactly what we need in clin- 
ical examinations. The pictures should be larger and more carefully 
drawn than those of most school tests, and it is essential that they be 
standardized without time limit. A separate sheet of paper for each 
test is much better than the booklet form, and it should be a full-size 
sheet. Tests for non-readers of the higher mental levels should be 
standardized up to eighth grade. Others, intended for small children, 
should be standardized for pre-school children as well as for the first 
two grades in school. 

In performance tests, as well as picture tests, there should be more 
overlapping of standardization between pre-school children and first 
grade children. There is at present much too sharp a break between 
the five-year and six-year tests. Toys have a very important place in 
tests for small children. Some of the tests standardized in nursery 
schools furnish very attractive material for the clinic child of four 
years, but there is a serious dearth of independently standardized toy 
material for the six-year level. It would be very helpful to the clinics 
if pre-school tests could be graded a little higher and standardized for 
the first two school grades. 

The directions test has been better developed at the top than at 
the lower levels. Three directions tests were offered by Woodworth 
and Wells (23) in 1911, one of which has been standardized by the 
writer (21). A few degrees more difficult than this is the last unit 
of the Kuhlmann-Anderson series (14). 

We are in need of an “easy directions” test something like the 
Woodworth-Wells tests of that name. It should be printed in large 
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type on a full-size sheet of paper and adapted to the lower levels of 
reading ability. Standardization, without time limit, should be for 
grades II to VII. 

The directions test involving the marking of pictures, used freely 
in many school tests, is well worthy of being developed as an inde- 
pendent series. There might be several one-page units of varying 
degrees of difficulty; the lowest to be standardized for the first two 
grades plus pre-school children, and the highest (with pictures for 
adults) to be standardized up to eighth grade. Still another direc- 
tions test might be a development of the “three commissions” item. 
Such instructions as “Lay this card on the table” and “Take one block 
in each hand and hold them both up” could be followed at an early 
age level, and the test could be graded up to six or seven years. This, 
of course, would call for individual presentation. For uniformity of 
standardization and for convenience in clinical use, it would be well 
to limit the directions to such as the child can obey without leaving 
his chair. 

A graded series of mixed sentences as a performance test was 
formulated by the writer some years ago but not standardized. The 
words were printed in plain type on cards about one inch in width 
and of varying length. Presenting movable words greatly increases 
the attractiveness of the mixed-sentence test, but it also introduces a 
chance factor not present when the words are shown in fixed order. 
To offset the effect of chance, it is necessary to use a much larger 
number of sentences. The first sentences used were “the baby is 
asleep” and “the cat has four kittens”. Children of seven years as- 
sembled these sentences delightedly, and older subjects found consid- 
erable pleasure in assembling the longer sentences. The test can be 
recommended for standardization, but one should start with at least 
twice as many sentences as are to be used. It is necessary to have a 
time limit for each sentence, as otherwise the subject will work almost 
indefinitely on a too-difficult task. 

Another language-performance test which fell by the wayside is 
based on the well-known game of word-building. Eleven letters were 
used, ABCDGILORT YY. The task was to make as many 
words as possible within five minutes, and the presentation was ex- 
actly the same for subjects of all ages. It is not necessary to use so 
many letters; but it is well to give enough letters so that the child 
will have something to show for his five minutes’ hard work, and it is 
important to use letters which spell familiar first-reader words. These 
letters were selected so as to include the words: cat, dog, boy, girl. 
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This test may be group standardized, for children who are old enough 
to write legibly. A little additional time should be allowed for 
writing the words. 

If anyone can devise a non-language, manipulative series that 
can be scored by something other than speed, such a test would be 
most welcome. The speed record does not tell the whole story in any 
performance test, but it is usually the only record that can be taken 
objectively. Move counting, with passable accuracy, is practically 
impossible for most performance tasks. In any event, speed is almost 
of necessity too important a factor to be left unrecorded. If both 
speed and moves be recorded, the two variables should be used sepa- 
rately for two sets of norms. Any attempt to penalize useless moves 
in terms of speed serves only to impair the value of both variables; but 
if treated independently, they permit the test to be used for two 
units of a battery instead of one. 

A graded series of picture puzzles would make a test very service- 
able for subjects not practiced in assembling jig-saw puzzles, but it 
must be acknowledged that suitable pictures are not easily found. 
Clear-cut lines and strong color contrasts are important, and it is de- 
sirable also that the pictures have some artistic merit. Each picture 
should be cut into rectangles of equal size. On no account should 
irregular cuttings be used, because this would encourage the subject 
to match the pieces by form instead of by the picture. We are 
already supplied with form boards covering a wide range of mental 
levels, and it is for measuring something other than form perception 
that we need a picture puzzle test. The series may begin with the 
picture of a child’s face, cut in two through the center. As a check 
on chance success in matching the two pieces, there should be more 
than one of the two-piece puzzles, approximately equal in difficulty. 
These may be presented repeatedly, alternately or in rotation, until 
purposeful matching is indicated or excluded. No timing is necessary 
at the lowest level, as it is merely a question of whether the child can 
or cannot match the pieces. For the rest of the series puzzles of four, 
six, ten, twenty and thirty pieces are suggested. The time should be 
recorded for the four-piece one. Speed becomes increasingly signifi- 
cant as the series progresses. The thirty-piece puzzle, scored by speed, 
is discriminative at the adult level for an unpracticed subject. Being 
a test of special aptitude, this test should not be used except as one 
unit of a battery. The aptitude is sometimes well developed at four 


years. 
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In developing tests which are intended to measure a person’s 
actual capacity along some line, not merely his acquiescence or his 
responsiveness to the examiner, it is important to keep always in mind 
the attractiveness of the material from the subject’s point of view. In 
spite of our best efforts the verbal material will remind some subjects 
of something painful, but at least we should avoid all reference to 
anything that is generally or necessarily distressing. There should be 
a great enough variety of tests to conform to the varied tastes of the 
subjects, and a sufficient number so that it will not be necessary to 
require the subject who hates arithmetic to perform a number series 
task. 

It is by no means desirable that presentation of tests should be 
made mechanical, but the difficulty of presentation might well be 
reduced sufficiently to insure a reasonable degree of uniformity. Eval- 
uation of results should be as objective as possible, in the interests of 
accuracy. 


SUMMARY 


The clinic needs a more flexible test system than any edition of 
the Binet-Simon scale yet offered. Clinical subjects are highly indi- 
vidual and varied in their interests and tastes. Any given test-item 
is sure to be found inapplicable to some subjects. A composite scale 
which has to be used as a whole may be grossly unfair to some par- 
ticular subject. 

Any test standardized by the age-grade method is needlessly 
wasteful of time in presentation. The time allowed for psychometric 
examinations in clinics is usually limited and frequently inadequate. 
Too much of this valuable time is being wasted by the presentation of 
non-discriminative material at the upper and lower ends of the sub- 
ject’s natural range. It is a waste of time to use items which add 
nothing to the adequacy of the examination, merely to satisfy a formal 
requirement. 

It is recommended that language test material be developed ac- 
cording to the method used by Pintner and Paterson for a group of 
performance tests. Any item which can be graded in difficulty may 
be developed into a graded series and standardized as an independent 
unit. Each unit should be so graded as to cover the entire range of 
mental levels for which it can appropriately be used; but for economy 
of presentation, the standardization should be for overlapping sec- 
tions rather than for the series as a whole (or in addition to the 
standardization for the series as a whole). If independent norms be 
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published for each section, the non-discriminative material to be used 
in a given examination may be reduced to a negligible quantity. 


When sufficient test-units have been thus developed and stand- 
ardized, an examiner may make up a battery by selecting for each 
case such units as are individually suited to the subject. The exam- 
ination will be custom-made to fit each subject, instead of the subject 
being held responsible for fitting the test. 

The battery of tests used in a given examination will yield a 
series of independent ratings, the median of which may be placed on 
record for reference. When a one-figure numerical rating is required, 
this median rating is the figure to be reported. 


A fixed scale, standardized as a whole and necessarily used as a 
whole, becomes less satisfactory year by year as its items pass out of 
date. A loose collection of independently standardized tests, on the 
other hand, comprises a growing system which possesses indefinite 
possibilities for growth. Any unit which is found unserviceable may 
be revised or superseded at any time, without disturbing the other 
units. Thus the system is at all times in process of being rejuvenated. 


Any student who has resources for developing a given unit may 
make his contribution to the collection of tests from which Clinical 
workers of the future may draw their material for any examination. 
Every student who is working on tests may have a share in a system 
which will be the product of many minds. 
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